Appendix B — Assignment B
Instructions
You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Do not write your name on the assignment.
Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.
Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command:
quarto render filename.ipynb --to html. Submit the HTML file.The assignment is worth 100 points, and is due on Sunday, 23rd April 2023 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (2 pts). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn’t seem genuine, you will lose points.
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
- For all questions on cross-validation, you must use
sklearnfunctions.
B.1 Degrees of freedom
Find the number of degrees of freedom of the following models. Exclude the intercept when counting the degrees of freedom. You may either show your calculation, or explain briefly how you are computing the degrees of freedom.
B.1.1 Quadratic spline
A model with one predictor, where the predictor is transformed into a quadratic spline with 5 knots
(2 points)
B.1.2 Natural cubic splines
A model with one predictor, where the predictor is transformed into a natural cubic spline with 4 knots
(2 points)
B.1.3 Generalized additive model
A model with four predictors, where the transformations of the respective predictors are (i) cubic spline transformation with 3 knots, (ii) log transformation, (iii) linear spline transformation with 2 knots, (iv) polynomial transformation of degree 4.
(4 points)
B.2 Number of knots
Find the number of knots in the following spline transformations, if each of the transformation corresponds to 7 degrees of freedom (excluding the intercept).
B.2.1 Cubic splines
Cubic spline transformation
(1 point)
B.2.2 Natural cubic splines
Natural cubic spline transformation
(1 point)
B.2.3 Degree 4 spline
Spline transformation of degree 4
(1 point)
B.3 Regression problem
Read the file investment_clean_data.csv. This data is a cleaned version of the file train.csv in last quarter’s regression prediction problem. Refer to the link for description of variables. It required some effort to get a RMSE of less than 650 with linear regression. In this question, we’ll use MARS / natural cubic splines to get a RMSE of less than 350 with relatively less effort. Use mean squared error as the performance metric in cross validation.
B.3.1 Data preparation
Prepare the data for modeling as follows:
Use the Pandas function
get_dummies()to convert all the categorical predictors to dummy variables.Using the
sklearnfunctiontrain_test_split, split the data into 20% test and 80% train. Userandom_state = 45.
Note:
A. The function get_dummies() can be used over the entire DataFrame. Don’t convert the categorical variables individually.
B. The MARS model does not accept categorical predictors, which is why the conversion is done.
C. The response is money_made_inv
(2 points)
B.3.2 Optimal MARS degree
Use -fold cross validation to find the optimal degree of the MARS model to predict money_made_inv based on all the predictors in the dataset.
Hint: Start from degree 1, and keep going until it doesn’t benefit.
(4 points)
B.3.3 Fitting MARS model
With the optimal degree identified in the previous question, fit a MARS model. Print the model summary. What is the degree of freedom of the model (excluding the intercept)?
(1 + 1 + 2 points)
B.3.4 Interpreting MARS basis functions
Based on the model summary in the previous question, answer the following question. Holding all other predictors constant, what will be the mean increase in money_made_inv for a unit increase in out_prncp_inv, given that out_prncp_inv is in [500, 600], term = 36 (months), loan_amnt = 1000, and int_rate = 0.1?
First, write the basis functions being used to answer the question, and then substitute the values.
Also, which basis function is non-zero for the smallest domain space of out_prncp_inv? Also, specify the domain space in which it is non-zero.
(3 + 2 points)
B.3.5 Feature importance
Find the relative importance of each predictor in the MARS model developed in B.3.3. You may choose any criterion for finding feature importance based on the MARS documentation. Print a DataFrame with 2 columns - one column consisting of predictors arranged in descending order of relative importance, and the second column quantifying their relative importance. Exclude predictors rejected by the model developed in B.3.3.
Note the forward pass and backward passes of the algorithm perform feature selection without manual intervention.
(4 points)
B.3.6 Prediction
Using the model developed in B.3.3, compute the RMSE on test data.
(2 points)
Non-trivial train data
Let us call the part of the dataset where out_prncp_inv = 0 as a trivial subset of data. For this subset, we can directly predict the response without developing a model (recall the EDA last quarter). For all the questions below, fit / tune the model only on the non-trivial part of the train data. However, when making predictions, and computing RMSE, consider the entire test data. Combine the predictions of the model on the non-trivial subset of test data with the predictions on the trivial subset of test data to make predictions on the entire test data.
B.3.7 Prediction with non-trivial train data
Find the optimal degree of the MARS model based on the non-trivial train data, fit the model, and re-compute the RMSE on test data.
Note: You should get a lesser RMSE as compared to what you got in B.3.6.
(4 points)
B.3.8 Reducing model variance
The MARS model is highly flexible, which makes it a low bias-high variance model. However, high prediction variance increases the expected mean squared error on test data (see equation 2.7 on page 34 of the book). How can you reduce the prediction variance of the model without increasing the bias? Check slide 12 of the bias-variance presentation. The MARS model, in general, corresponds to case B. You can see that by averaging the predictions of multiple models, you will reduce prediction variance without increasing the bias.
Take 10 samples of train data of the same size as the train data, with replacement. For each sample, fit a MARS model with the optimal degree identified earlier. Use the model, say to make prediction on each test data point (Note that predictions will be made using the model on the non-trivial test data, and without the model on the trivial test data). Compute the average prediction on each test data point based on the 10 models as follows:
Consider as the prediction at the test data point . Compute the RMSE based on this model. which is the average prediction of 10 models. You should get a lesser RMSE as compared to the previous question (B.3.7).
Note: For ease in grading, use the Pandas DataFrame method sample to take samples with replacement, and put random_state for the ith sample as i, where i goes from 0 to 9.
(6 points)
B.3.9 Generalized additive model (GAM)
Develop a Generalized linear model to predict money_made_inv as follows:
where is a MARS model of degree .
Print the estimated beta coefficients () of the developed model.
Note: The model is developed on the non-trivial train data
(8 points)
B.3.10 Prediction with GAM
Use the GAM developed in the previous question to compute RMSE on test data.
Note: Predictions will be made using the model on the non-trivial test data, and without the model on the trivial test data
(5 points)
B.3.11 Reducing GAM prediction variance
As we reduced the variance of the MARS model in B.3.8, follow the same approach to reduce the variance of the GAM developed in B.3.9, and compute the RMSE on test data.
Note: You should get a lesser RMSE as compared to what you got in B.3.10.
(8 points)
B.3.12 Natural cubic splines
Even though MARS is efficient and highly flexible, natural cubic splines work very well too, if tuned properly.
Consider the predictors identified in the model summary of the MARS model printed in B.3.3. For each predictor, create natural cubic splines basis functions with degrees of freedom. Include all-order interactions (i.e., 2-factor, 3-factor, 4-factor interactions, and so on) of all the basis functions. Use the sklearn function cross_val_score() to find and report the optimal degrees of freedom for the natural cubic spline of each predictor.
Consider degrees of freedom from 3 to 6 for the natural cubic spline transformation of each predictor.
(8 points)
B.3.13 Fitting the natural cubic splines model
With the optimal degrees of freedom identified in the previous question, fit a model to predict money_made_inv, where the basis functions correspond to the natural cubic splines of each predictor, and all-factor interactions of the basis functions. Compute the RMSE on test data.
Note: Predictions will be made using the model on the non-trivial test data, and without the model on the trivial test data
(4 points)
B.4 GAM for classification
The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.
There is one train data - train.csv, which you will use to develop a model. There are two test datasets - test1.csv and test2.csv, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:
age: Age of the clienteducation: Education level of the clientday: Day of the month the call is mademonth: Month of the cally: did the client subscribe to a term deposit?duration: Call duration, in seconds. This attribute highly affects the output target (e.g., ifduration=0 theny=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the callyis obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.
(Raw data source: Source. Do not use the raw data source for this assignment. It is just for reference.)
Develop a generalized additive model (GAM) to predict the probability of a client subscribing to a term deposit based on age, education, day and month. The model must have:
(a) Minimum overall classification accuracy of 75% among the classification accuracies on train.csv, test1.csv and test2.csv.
(b) Minimum recall of 55% among the recall on train.csv, test1.csv and test2.csv.
Print the accuracy and recall for all the three datasets - train.csv, test1.csv and test2.csv.
Note that:
You cannot use
durationas a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively.One way to develop the model satisfying constrains (a) and (b) is to use spline transformations for age and day, and interacting month with all the predictors (including the spline transformations)
You may assume that the distribution of the predictors is the same in all the three datasets. Thus, you may create B-spline basis functions independently for the train and test datasets.
Use cross-validation on train data to optimize the model hyperparameters, and the decision threshold probability. Then, use the optimal hyperparameters to fit the model on train data. Then, evaluate its accuracy and recall on all the three datasets. Note that the test datasets must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.
(20 points: 10 points for cross validation, 5 points for obtaining and showing the optimal values of the hyperparameters and decision threshold probability, 2 points for fitting the model with the optimal hyperparameters, and 3 points for printing the accuracy & recall on each of the three datasets)